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(54) Method and apparatus for generating index data for search engines 



(57) The invention relates to a method for generat- 
ing index data to be provided to a search engine to be 
used for searching the internet or a non-public network, 
said index data comprising one or more search indices, 
said method comprising the steps of: generating at least 
one search index on a computer based on data stored 



on or accessible by said computer, said computer being 
located remotely from said search engine, said index 
being generated in accordance with one or more set- 
tings or selectable options defining which data stored 
on or accessible by said computer is to be used for gen- 
erating said at least one index. 
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[0012] Several search engines are capable of main- 
taining and visualizing the index database's content in 
a hierarchical (tree-like) catalog structure. Thus, a com- 
plementary way of discovering information is available: 
instead of keying in search terms, a user may -navigate 
step-by-step through the catalog's (sub)categories 

(such as sports, finance, technology, news ) in order 

to find relevant pieces of information. Such a catalog's 
maintenance requires manual effort by the search en- 
gine's host even if there is an internet frontend permit- 
ting users to add catalog records online (today, no fully 
automated solution is available on the market). It is cru- 
cial to understand that these search engine catalogs on- 
ly offer very limited functionality in terms of search pa- 
rameters and datamodel flexibility: at the bottom line it's 
not a fully-fledged structured datamodel for example for 
cars or holiday trips but simply a hierarchical navigation 
model that lets users work their way down the navigation 
tree starting at more general categories (e.g. "sports") 
and ending up at more and more specialized categories 
(e.g. "equipment for river rafting" or "sports events in At- 
lanta"). 

[0013] Another approach to store searchable data in 
a structured catalog on a search engine comprises the 
capability to handle structured data which is extracted 
from webpages containing for example data in a tabular 
manner (e.g. a price list for products offered on a specific 
webpage). In this scenario pieces of structured data will 
be transferred from the webserver to the search engine 
in two different ways: 

[0014] The structured data contained in webpages 
which are meant to be indexed can be copied manually 
by the staff that runs the search engine. However, this 
is a time-consuming and error-prone way of maintaining 
the search engine's database. Thus, it is a common ap- 
proach to write some sort of "scanner program" that 
loads and analyzes webpages containing structured da- 
ta. This must be done on the side of the search engine. 
That way, the process of retrieving and updating struc- 
tured data can be automated on a per-webpage basis 
only which is the main catch here: since there are mil- 
lions of webpages potentially and in practice containing 
all different formats and models of structured data there 
is no way of covering even a small share of them as long 
as it is necessary to write one scanner program for each 
webpage the structured contents of which is to be ana- 
lyzed and stored on the search engine. 
[0015] All of the conventional approaches, the index- 
ing concept as well as the catalog concepts as known 
in the prior art suffer from substantial disadvantages. 

Summary of the invention 

[0016] It is an object of the present invention to pro- 
vide an improved method and apparatus for generating 
searchable data to be stored on and made available 
through search engines. 

[0017] According to an aspect of the present inven- 
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tion, at least one search index to be used by a search 
engine when searching the internet is generated on a 
computer located remotely from the host on which the 
search engine is located. This removes the responsibil- 

5 ity of generating search(able) index data from the host 
or operator of the search engine. Instead, this task is 
performed on a computer where the data which later is 
to be found through the search engine is located. 
[0018] With such a configuration the host, owner or 

io operator of the computer on which the search index data 
is generated may directly influence the contents of the 
search index to be maintained by the search engine and 
thereby he can increase the likelihood that the data he 
wishes to be found by network users when conducting 

is search operations actually will show up as a query re- 
sult. A software module hereinafter called an indexer 
module running on the computer which is hosting the 
data to be indexed may generate the search index data 
and this data may then be transferred to the search en- 

20 gine host where it is incorporated into or joins the search 
(able) index already present there. 
[0019] Furthermore such a configuration can prevent 
inconsistent query results such as that e.g. a link might 
be broken (i.e. invalid); a link might lead to a web page 

25 offering information on other topics than it did when it 
was scanned by the search engine, etc., since using in- 
dexing remote from the search engine the control of in- 
dex and catalog data is transferred to the information 
source (= host of an internet website); in other words: 

30 the search engine does no longer decide what contents 
are meant to be gathered. 

[0020] Moreover, such a configuration enables auto- 
mated support not only for static web pages but also for 
all kinds of dynamically generated web pages since the 
35 index generation is carried out on the computer where 
the dynamic web application is running and not on the 
search engine host. 

[0021 ] The computer running the indexer module and 
thus building the index data may be a webserver or any 

^0 content server being connected to the Internet or to a 
non-public network; it may also be a normal computer 
on which a web application is running or on which any 
internet content can be stored. For example a user of 
the index generating computer may have his own web- 

45 site being located on another computer such as the 
server of his internet provider where his webspace is 
located, but for generating the index he may just transfer 
his website or web application down to his own compu- 
ter and may thereon carry out the generation of the in- 

so dex. For that purpose a crawler may be provided in the 
indexer module for retrieving the internet based content 
such as a website from the remote server to then build 
the index data. 

[0022] The indexer module may be installed on a web- 
55 server as add-on. The thereby built index is then trans- 
ferred ("pushed") from the originating webserver to the 
search engine where it joins the already existing index 
database. 
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[0037] Unstructured data = textual information; e.g. 
used in conjunction with so-called fulltext searches; 
when being indexed, unstructured data may be extend- 
ed to incorporate additional information such as docu- 
ment (webpage) owner, location or timezone; 5 
[0038] Single index = index comprising unstructured 
data; the index may be extended to incorporate addi- 
tional information such as document (webpage) owner, 
location or timezone; 

[0039] Structured data = contents available in a tab- 
ular, relational or object-oriented manner; tabular data 
can be found for example within static or dynamically 
generated webpages; relational data is typically stored 
in and retrieved from relational database management 
systems (RDBMS); object-oriented data is typically 
stored in and retrieved from object-relational or object- 
oriented database management systems; 
[0040] Plurality of indices = collection of indices; 
[0041] Group of indices = plurality of indices belong- 
ing to the same context, i.e. describing a specific model 
of structured data (e.g. the set of attributes specifying a 
specific topic such as "Journeys" or "Sports Events" or 
any other topic) 

[0042] In the context of this document index data is 
generated for use by search engines on the internet or 
on non-public networks. Thus, index data is a special 
representation of structured and / or unstructured con- 
tents. The index will be used to differentiate between, 
search for, and locate information resources which are 
made available through the internet or on a non-public 
network. Today, each individual resource of information 
(e.g. a specific webpage) can be addressed by its 
unique URL (= uniform resource locator). 
[0043] Web resource = any webpage or document or 
piece of data accessible through the internet or on a 
non-public network by means of an URL: 
[0044] Search index = index data and / or searchable 
index database maintained by and accessible through 
a search engine on the internet or on a non-public net- 
work; the index database may consist of multiple search 
indices supporting various datamodels. Each datamod- 
el maps to a specific category of searchable content (e. 
g. "unstructured text". M IT components", "Books", "Mov- 
ies", "Tickets"); 

[0045] Searchable index = synonym for search in- 
dex: 

[0046] Index database = database storing informa- 
tion which is originally transmitted to the search engine 
using index data; the database may support unstruc- 
tured data and / or structured (i.e. tabular, relational, ob- 
ject-oriented) data; 

[0047] Catalog database = particular characteristic 
of an index database supporting only structured (i.e. tab- 
ular, relational, object-oriented) data in the context of a 
search engine based on push indexing; 
[0046] Content server = server system making avail- 
able various kinds of content resources (e.g. MPEG 
movies, HTML webpages, Acrobat PDF Documents) on 



8 

the internet or on a non-public network; 
[0049] A first embodiment of the present invention will 
be described with reference to Fig. 3. 
[0050] As soon as the indexer module is installed as 
a webserver extension it can be used to generate index 
data for all content that's made available through the 
webserver. The indexer module reads the webserver 
configuration and thus determines which virtual paths 
and virtual servers are available. Index data is then built 
(1) according to the administrator's settings (selection 
of static pages and formats, index update schedule, re- 
gional classification, etc.). Indexing the contents of dy- 
namically built web pages potentially containing struc- 
tured data may involve writing program code in a simple 
script language (or any other suitable programming lan- 
guage supported by the indexer module) to some ex- 
tent: since structured data typically resides in relational 
databases the indexer module is capable of connecting 
to ODBC and JDBC datasources. In conjunction with an 
easy to learn tag-based programming language any 
kind of database content may be retrieved and added 
to the originating index. 

[0051 ] Schematically this can be done by defining the 
following in a script language: 

a) Define data fields (and/or data sources) to be ac- 
cessed 

b) retrieve data from those data fields 

c) convert format of retrieved data to match with for- 
mat required by the index of the search engine 

d) define further classification of index data 

e) define update interval 

[0052] The individual steps are now explained exem- 
plarily in connection with the generation of index data 
relating to literature. Step a) then defines for example 
that the data fields author and title and price are to be 
accessed and indexed. In step b) the data is retrieved. 
Step c) may define the necessary calculation formula to 
convert the price f rom the currency used in the accessed 
database to the currency used in the index of the search 
engine. Step d) may additionally be defined that the so 
generated index data has the regional classification 
"Germany", which means that a search in the search 
engine using the regional classification "France" will not 
lead to the so generated index data being interpreted as 
a hit. Finally step e) defines the update interval in which 
the index is to be updated. 

[0053] With such an example a bookstore may ac- 
cording to its desire generate index data by accessing 
its own database and then sending the so generated in- 
dex data to a search engine host where it is incorporated 
into the search index of the search engine. 
[0054] If step c) and d) are omitted, then the so gen- 
erated index data may be used in the most general (un- 
structured) possible index in which each entry only con- 
sists of a search term and the conesponding URL. The 
retrieved data just is indexed and then sent to the search 
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catalog data the search engine's loader module again 
creates and stores a set of billing data first (3a). Again, 
billing data is returned to the indexer module which orig- 
inated the transfer (3b). Then the loader module sepa- 
rates the index data from the catalog data. While the s 
former is placed in the index queue the latter is added 
to the catalog queue (3c). Due to the different nature of 
unstructured index data and structured catalog data two 
separate databases are used. Accordingly, an addition- 
al catalog scanner (4b) is required since the index scan- 10 
ner (4a) only handles index data. 
[0068] From that moment on, both the index data and 
the catalog data can be retrieved by internet users who 
are performing queries accessing this particular search 
engine's internet frontend with a web browser (5). Que- '5 
ries may be run against only the index database, only 
the catalog database or both databases simultaneously 
(6a , 6b). Result sets returned by the databases (7a, 7b) 
are merged together if necessary. The combined set of 
results is then transmitted back to the web browser 20 
where the query originated from (8). Now the internet 
user is likely to find the desired pieces of information on 
one of the web pages the URLs of which are contained 
in the search results (9). 

[0069] Of course it is also possible to generate and 25 
transmit catalog data only without generating and trans- 
mitting index data. 

[0070] In the foregoing embodiments a technology to 
generate index data remotely from a search engine has 
been described. This results in several significant ad- 30 
vantages over the prior art approach. 
[0071] E. g. there is provided an enabling technology 
lor realizing not only a technologically new and inventive 
approach but also for realizing a new business model: 
since remote indexing decisively increases the level of 35 
quality offered by search engines usage lees can be 
easily justified; any website host running an indexer 
module will e.g. be billed according to the number of 
URLs and / or catalog entries selected for storage on 
the search engine and their respective update intervals. *o 
[0072] Search engines based on the remote indexing 
approach are fully compliant to eBusiness applications 
and information brokerage applications (as opposed to 
common search engines) since they can easily handle 
dynamically generated webpages potentially and in 45 
practice containing structured and/or unstructured con- 
tent. 

[0073] Furthermore remote indexing is much more 
bandwidth- efficient than the method employed by com- 
mon search engines because it does not require trans- 50 
mission ol complete web pages between a website and 
the search engine. Typically, index data shrinks to about 
40% of the original page's size. To further reduce the 
amount of time (and bandwidth) needed to transfer up- 
dates index and catalog data may be packed (com- 55 
pressed) prior to transmission. 

[0074] Although not expressly mentioned before a 
catalog host may also be implemented on its own (that 
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is, without the index-handling part dealing with unstruc- 
tured index data). Today, none of the large, well-known 
Internet search engines processes structured catalog 
data alone but this may become a promising approach 
in the future. 

[0075] It can also be imagined that a hybrid configu- 
ration of the conventional techology and the technology 
of the present invention is employed. E.g. a convention- 
ally generated search index may be freely accessed and 
searched by a user and a search index generated ac- 
cording to the present invention can only be accessed 
if the user has accepted to be charged (or it. 
[0076] Communication between indexer modules and 
the remote indexing search engine may be implemented 
using a TCP/IP-based protocol. Even though it is imag- 
inable to make use of standard protocols such as HTTP 
or FTP defining a variation thereof or even introducing 
a completely new protocol may turn out to be reasonable 
for applying the present invention. 
[0077] The data format used to describe index and 
catalog data may be in the form of XML or any of its 
derivatives. 

[0078] Apart from regional classification of index or 
catalog data any other conceivable classification is pos- 
sible as well. 

[0079] It is readily apparent to the expert from the fore- 
going that hereinbefore there have been mentioned em- 
bodiments which are exemplary only and which can be 
easily modified or supplemented without departing from 
the spirit and scope of the present invention. E.g. if nec- 
essary, index and/or catalog data might be encrypted 
prior to transmission. Furthermore it should be clear that 
the elements of the embodiments described above may 
be realized by means of software, or by means of hard- 
ware, or by a combination of both of them. 



Claims 

1. A method lor generating index data to be provided 
to a search engine to be used for searching the in- 
ternet or a non-public network, said index data com- 
prising one or more search indices, said method 
comprising the steps of: 

generating at least one search index on a 
computer based on data stored on or accessible by 
said computer, said computer being located re- 
motely from said search engine, said index being 
generated in accordance with one or more settings 
or selectable options defining which data stored on 
or accessible by said computer is to be used for 
generating said at least one index. 

2. A method according to claim 1 . further comprising: 

transmitting said index from said remote com- 
puter to said search engine to enable the search 
engine to incorporate the thus transmitted index into 
one or more of its search indices used lor searching 
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17. A method according to claim 1 6, further comprising 
one or more of the followingbilling a search engine 
user for carrying out a said search method; 

billing the host and/or operator and/or owner 
of said computer said contents of which is indexed 5 
and in turn transmitted to said search engine for 
making said index data available to network users 
through said search engine. 

18. An apparatus for generating index data to be pro- 10 
vided to a search engine to be used for searching 

the internet or a non-public network using one or 
more search indices, said apparatus comprising: 

means for generating at least one search in- 
dex on a computer based on data stored on or ac- '5 
cessible by said computer, said computer being lo- 
cated remotely from said search engine, said index 
being generated in accordance with one or more 
settings or selectable options defining which data 
stored on or accessible by said computer is to be 20 
used for generating said at least one search index. 

19. An apparatus according to claim 18, further com- 
prising: 

means for carrying out a method according to 25 
any one of claims 2 to 17. 

20. A computer program comprising computer execut- 
able instructions (or causing a computer to carry out 

a method according to any of claims 1 to 17. 30 

21. A data structure comprising: 

at least one search index which can be used 
by a search engine and which has been generated 
by a method according to one of claims 1 to 17. 35 
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